Skip to content

Add CUDA process checkpointing helpers#1983

Open
kkraus14 wants to merge 10 commits intoNVIDIA:mainfrom
kkraus14:kk/issue-1343-cuda-checkpointing
Open

Add CUDA process checkpointing helpers#1983
kkraus14 wants to merge 10 commits intoNVIDIA:mainfrom
kkraus14:kk/issue-1343-cuda-checkpointing

Conversation

@kkraus14
Copy link
Copy Markdown
Collaborator

@kkraus14 kkraus14 commented Apr 28, 2026

Summary

  • add a dedicated cuda.core.checkpoint module for CUDA process checkpointing APIs while keeping cuda.core.system focused on CUDA system/NVML capabilities
  • expose a narrow runtime API via checkpoint.Process(pid): read-only pid, state, restore_thread_id, lock, checkpoint, restore, and unlock
  • keep the checkpoint module public surface limited to Process; the state return type lives in cuda.core.typing.ProcessStateT and is rendered in the private API docs
  • map Process.state from the CUDA driver CUprocessState enumerators rather than raw integer values
  • support restore-time GPU UUID remapping using either driver CUuuid values or Device.uuid strings; migration docs and tests now describe the stricter kernel-mode-driver visibility requirement rather than user-space CUDA visibility
  • document the coordinator/target-process model, Linux permission requirements such as CAP_SYS_PTRACE, the CRIU/CPU-process-image boundary, restore-thread requirement, and persistence mode/cuInit restore requirement
  • validate checkpoint API availability lazily and cache the successful check, covering the cuda-bindings version, required binding symbols, and CUDA driver version
  • consolidate checkpoint driver call handling in one boundary that translates missing checkpoint symbols and unsupported checkpoint CUDA results into a checkpoint-specific RuntimeError
  • re-enable checkpoint driver coverage in CI by running driver-backed lifecycle tests in isolated coordinator/target subprocess scenarios with parent-side timeouts; migration tests skip when the CUDA device view is masked because the mapping cannot be proven KMD-complete

Closes #1343

Testing

  • git commit -S pre-commit hooks: ruff, formatting, SPDX, whitespace, RST, and related checks passed
  • git diff --check
  • pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py (All checks passed)
  • pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py (10 passed, 6 skipped)
  • pixi run --manifest-path cuda_core -e docs docs-build-latest (Sphinx build succeeded)
  • previous broader local run: pixi run --manifest-path cuda_core pytest cuda_core/tests --ignore=cuda_core/tests/cython (2798 passed, 352 skipped, 2 failed)

The two Python-suite failures in the broader run are existing local NVML/system environment failures and are not related to this checkpointing change:

  • cuda_core/tests/system/test_system_device.py::test_get_inforom_version returns an empty InfoROM board part number locally.
  • cuda_core/tests/system/test_system_system.py::test_get_process_name hits an NVML UTF-8 decode error locally.

Additional local build/test notes:

  • pixi run --manifest-path cuda_core test stops before pytest in the existing build-cython-tests pre-step because cuda_core/tests/cython/test_get_cuda_native_handle.pyx cannot find the expected cuda.bindings .pxd files in this local pixi environment.
  • pixi build from cuda_core reaches the existing native cuda-core extension build and then fails with CUDA 12.9 headers that do not declare CU_MEM_ALLOCATION_TYPE_MANAGED; this is in the existing graph extension build path and is not checkpoint-specific.

CI note:

  • The previous CI attempt on 8192df67 exposed two unrelated/runner issues: one Windows py3.12 build failed inside the shared mini-CTK cache setup before any cuda.core build step, and CUDA 13.x GPU test jobs were canceled after the old in-process checkpoint test hung in cuCheckpointProcessCheckpoint.
  • The current head removes the broad CI skip. Driver-backed checkpoint lifecycle tests now run through isolated subprocess coordinator/target scenarios, and the parent pytest process can kill and skip a scenario that times out instead of letting the CI job hang.

Current Test Implementation

The checkpoint tests in cuda_core/tests/test_checkpoint.py are real driver/GPU tests, not broad mocks.

Input validation and public-symbol checks run everywhere. Driver-backed lifecycle tests create a target process that initializes a real CUDA context, then a coordinator scenario calls checkpoint.Process(target.pid) and exercises state, restore_thread_id, lock, checkpoint, restore, and unlock through the real driver. The parent pytest process enforces a timeout around each scenario so unsupported driver/hardware paths skip cleanly instead of hanging the test job.

Migration tests require at least two same-chip GPUs and an unmasked CUDA device view. They build full UUID mappings using Device.uuid strings, then exercise rotation and pair-swap migration patterns through Process.restore(gpu_mapping=...) in the isolated target process. They skip gracefully when CUDA_VISIBLE_DEVICES is set, when the local hardware lacks a same-chip GPU pair, or when the driver rejects/no-ops checkpoint migration.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 28, 2026
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 396a2ca to 7c66b2f Compare April 28, 2026 16:28
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test 7c66b2f

@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch 2 times, most recently from 779c697 to 82f816c Compare April 28, 2026 16:44
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@github-actions
Copy link
Copy Markdown

Comment thread cuda_core/cuda/core/system/__init__.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 82f816c to 25455d8 Compare April 28, 2026 18:22
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 25455d8 to aaf1418 Compare April 28, 2026 19:14
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from aaf1418 to d8a2031 Compare April 28, 2026 20:24
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 marked this pull request as ready for review April 29, 2026 13:59
@kkraus14 kkraus14 added the feature New feature or request label Apr 29, 2026
@kkraus14 kkraus14 added this to the cuda.core v1.0.0 milestone Apr 29, 2026
@kkraus14 kkraus14 self-assigned this Apr 29, 2026
@rparolin rparolin requested review from leofang and rparolin April 29, 2026 17:44
Comment thread cuda_core/tests/test_checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/docs/source/api.rst Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/tests/test_checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
leofang and others added 5 commits May 2, 2026 04:06
Replace the entire mock-based test suite with real GPU tests that
exercise the CUDA driver checkpoint API directly:

- Input validation: pid type/range, public symbol checks
- Lifecycle (single GPU): state transitions at every stage
  (running→locked→checkpointed→locked→running), restore_thread_id,
  lock/unlock, lock with timeout, full checkpoint-restore cycle
- GPU migration: rotation mapping and same-chip swap following the
  r580-migration-api.c pattern; gracefully skip when the driver does
  not support migration (CUDA_ERROR_INVALID_VALUE — NVBug 5437334)

The self_process fixture wraps os.getpid() and safety-unlocks on
teardown if the test fails mid-lifecycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- checkpoint._make_restore_args now accepts UUID strings (as returned
  by Device.uuid) in addition to CUuuid objects, via a new _as_cuuuid
  helper that converts "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" strings
  to CUuuid using ctypes.
- Tests no longer import cuda.bindings.driver; all device queries use
  cuda.core.Device (Device().uuid for current device, Device.uuid for
  mapping keys/values, Device.get_all_devices() for enumeration).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ruff import sorting, ruff format, and noqa annotation for
best-effort teardown in the self_process fixture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The swap migration test calls set_current() on a different device.
Record the initial device from init_cuda and restore it on teardown
so tests are side-effect free.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the current CUDA checkpointing review comments in signed commit 8192df6.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

Review follow-up notes for signed commit 8192df67:

GitHub accepted the inline reply for the restore-mapping xref thread, but the remaining inline reply attempts are currently returning GitHub server errors, so I am summarizing the responses here.

  • ProcessStateT: moved to cuda.core.typing, included in typing.__all__, and listed in api_private.rst. cuda.core.checkpoint imports it privately and does not expose it in checkpoint.__all__ or as checkpoint.ProcessStateT.
  • Process state mapping: updated to use the actual driver.CUprocessState.CU_PROCESS_STATE_* enumerators as keys rather than raw Python ints.
  • Checkpoint docs: rewrote the section to describe the coordinator/target PID model, show os.getpid() only as a self-checkpoint example, call out Linux permissions including CAP_SYS_PTRACE, explain the CRIU/CPU-process-image boundary, and document restore-thread plus persistence mode/cuInit requirements.
  • Tests: replaced the broad mock suite with real driver/GPU tests. The lifecycle tests self-checkpoint the pytest process through lock, checkpoint, restore, and unlock; migration tests rotate/swap full CUDA-visible Device.uuid string mappings on suitable same-chip multi-GPU systems and skip unsupported hardware/driver cases.
  • Mapping semantics: docs and tests now use full CUDA-visible mappings for migration. Tests also skip at call time when a driver exposes checkpoint symbols but returns CUDA_ERROR_NOT_SUPPORTED for actual checkpoint operations.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 4, 2026

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test 8192df6

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

Copy-pasta from my bot, with internal info redacted.

What I did

  • Discovered GPU migration doesn't work on this machine (heterogeneous GPUs + known driver bugs)
  • Searched internal resources (Confluence, gdrive, NVBugs, P4) to understand the root cause
  • Rewrote the full test suite with real GPU tests and no mocks
  • Pushed 4 commits to Keith's PR branch

Test suite structure (13 tests)

  • 5 input validation (no GPU needed): invalid pid types/values, public symbols
  • 6 lifecycle (single GPU, real driver): state transitions at every step, restore_thread_id, lock/unlock with timeouts, full checkpoint/restore cycle
  • 2 migration (≥2 same-chip GPUs): rotation and swap — exercise real driver API, skip gracefully when unsupported

Lessons learned

  1. nvidia-smi order ≠ CUDA device order. We wasted time testing the wrong GPU pair because CUDA_VISIBLE_DEVICES=0,3 selected Ada + A6000 (different chips) instead of two A6000s. Always use Device.uuid to identify GPUs, never assume nvidia-smi indices match CUDA indices.

  2. CUDA_VISIBLE_DEVICES breaks checkpoint migration. Tested locally: even a 2-pair identity mapping fails with CUDA_ERROR_INVALID_VALUE when only 2 of 4 GPUs are visible, while the same 4-pair identity mapping succeeds with all GPUs visible. The CUDA driver docs state the mapping must cover "every checkpointed GPU," and the driver records all physically attached GPUs at checkpoint time — not just the CUDA_VISIBLE_DEVICES subset.

  3. Migration requires exact chip match. Tested locally: any mapping that remaps between different architectures (e.g. Ada ↔️ A6000) is rejected with CUDA_ERROR_INVALID_VALUE. Same-chip mappings (A6000 ↔️ A6000) are accepted. The public CUDA docs say "the GPU to restore onto needs to be of the same chip type as the old GPU."

  4. Migration may be a no-op on some driver versions. Tested locally: same-chip A6000 swap is accepted (no error) but the context device UUID doesn't change — the driver silently no-ops. This is consistent with NVBug 5437334 (api_reverse_gpu_pairs fails on GA100x4) and NVBug 5544504 (customer report: "restore on a different GPU — the checkpointed process does not appear in the GPU process list").

  5. Checkpoint state machine is strict. Can't unlock from "checkpointed" state — must restore first (CUDA_ERROR_ILLEGAL_STATE). Can't corrupt GPU memory between checkpoint and restore within the same process (CUDA APIs are frozen). The overwrite-then-restore scenario requires an external coordinator (CRIU).

  6. Use cuda.core APIs in tests, not raw bindings. Device().uuid for current device, Device.uuid for mapping keys, Device.get_all_devices() for enumeration. The implementation (checkpoint.py) handles the string-to-CUuuid conversion internally.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

Pushed e9c03de because the real tests hang in the CI...

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

/ok to test e9c03de

@leofang leofang requested review from Andy-Jost and leofang May 4, 2026 17:35
leofang and others added 2 commits May 4, 2026 16:49
cuCheckpointProcessCheckpoint hangs on CI runners (ephemeral VM +
container), causing all CUDA 13.x test jobs to time out.  Skip the
tests that call into the checkpoint driver when the CI environment
variable is set.  Input validation tests still run everywhere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from e9c03de to 8f798f4 Compare May 4, 2026 20:50
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test 8f798f4

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

@kkraus14 I noticed your force-push (to stringify the tests for subprocess) still includes my WAR (checking the env var CI), so the tests are still not run in the CI environment, is it intended?

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I also made some changes, would be nice for @Andy-Jost to re-review 🙂

Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

Addressed Leo’s latest review comments in signed commit 376acc7f18.

On CI: the remaining CI skip was a temporary workaround, not the intended final state. I removed the broad CI skip after moving all driver-backed lifecycle tests onto the isolated subprocess coordinator/target harness. The parent pytest process enforces timeouts and kills the scenario process group on timeout, so CI should get checkpoint coverage without repeating the previous job-level hang. Migration tests still skip when CUDA_VISIBLE_DEVICES is set or when the hardware/driver cannot provide valid same-chip migration, because the restore mapping must cover the KMD-visible GPU set.

Other updates:

  • Process.pid is now read-only via private _pid storage plus a property, with test coverage.
  • _call_driver now owns checkpoint missing-symbol and unsupported-result translation in one place.
  • Docs and migration tests now describe/use the KMD visibility rule instead of the looser CUDA-visible wording.
  • The PR description has been updated for the current implementation and validation state.

Local validation passed: ruff, git diff --check, focused pytest (10 passed, 6 skipped), and docs-build-latest.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test 376acc7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CUDA Checkpointing

4 participants